Learning outcomes

What is statistics?

Data analysis taking uncertainty into account

Probability theory

Random variables

The outcome of a random experiment can be described by a random variable.

Example random variables:

  • The weight of a random newborn baby
  • The smoking status of a random mother
  • The hemoglobin concentration in blood
  • The number of mutations in a gene
  • BMI of a random man
  • Weight status of a random man (underweight, normal weight, overweight, obese)
  • The result of throwing a die

Whenever chance is involved in the outcome of an experiment the outcome is a random variable.

A random variable is usually denoted by a capital letter, \(X, Y, Z, \dots\). Values collected in an experiment are observations of the random variable, usually denoted by lowercase letters \(x, y, z, \dots\).

A random variable can not be predicted exactly, but the probability of all possible outcomes can be described.

The population is the collection of all possible observations of the random variable. Note, the population is not always countable.

A sample is a subset of the population.

Discrete random variables

A discrete random variable can be described by its probability mass function.

  • The number of dots on a die
x 1.00 2.00 3.00 4.00 5.00 6.00
p(x) 0.17 0.17 0.17 0.17 0.17 0.17
Probability mass function of a die.

Probability mass function of a die.

  • The smoking status of a random mother

The random variable has two possible outcomes; non-smoker (0) and smoker (1). The probability of a random mother being a smoker is 0.44.

non-smoker smoker
x 0 1
p(x) 0.61 0.39
  • The number of bacterial colonies on a plate

The probability that the random variable, \(X\), takes the value \(x\) is denoted \(P(X=x) = p(x)\). Note that:

  1. \(0 \leq p(x) \leq 1\), a probability is always between 0 and 1.
  2. p(x) = 1, the sum over all possible outcomes is 1.

Exercise: Dice experiment

When throwing 10 dice, how many dice show 6 dots?

Bernoulli trial

A Bernoulli trial is a random experiment with two outcomes; success and failure. The probability of success, \(P(success) = p\), is constant. The probability of failure is \(P(failure) = 1-p\).

When coding it is convenient to code success as 1 and failure as 0.

The outcome of a Bernoulli trial is a discrete random variable, \(X\).

x 0 1
p(x) p 1-p

Binomial distribution

Also the number of successes in a series of independent and identical Bernoulli trials is a discrete random variable.

\(Y = \sum_{i=0}^n X_i\)

The probability mass function of \(Y\) is called the binomial distribution.

Continuous random variable

A continuous random variable can be described by its probability density function.

Example: baby weight

The data set babies consists of data for 1236 male babies and their mothers. All babies are born in Oakland in the 1960s.

The weight of random newborn baby is a continuous random variable, lets call it \(W\). In this example the entire population is known and can be summarized in a histogram.

Probability density function, pdf

The probability density function, \(f(x)\), is defined such that the total area under the curve is 1.

\[ \int_{-\infty}^{\infty} f(x) dx = 1 \]

\(P(a \leq X \leq b) = \int_a^b f(x) dx\)

Cumulative distribution function, cdf

The cumulative distribution function, sometimes called just the distribution function, \(F(x)\), is defined as:

\[F(x) = P(X<x) = \int_{-\infty}^x f(x) dx\]

\[P(X<x) = F(x)\]

\[P(X \geq x) = 1 - F(x)\]

\[P(a \leq X < b) = F(b) - F(a)\]

Probability

When the entire population is known, probabilities can be computed by summing the number of observations that fulfil the criteria and divide by the total number.

Weight distribution

Weight distribution

library(UsingR)
##The weights are originally in ounces, transform to kg
ounce <- 0.0283495231
wt <- babies$wt*ounce
## P(W > 4.0)
## Count the number of babies with a weight > 4.0 kg
sum(wt>4)
## [1] 133
## How many babies in total
length(wt)
## [1] 1236
## Fraction of babies with weight > 4.0 kg, this is P(W>4.0)
sum(wt>4)/length(wt)
## [1] 0.11
## Another way to compute P(W>4.0)
mean(wt>4)
## [1] 0.11

Exercise:

Based on the babies population, compute the following probabilities

Smoking status of a random mother

## # A tibble: 5 x 4
##   smoke     n       p code                   
##   <dbl> <int>   <dbl> <chr>                  
## 1     0   544 0.440   never                  
## 2     1   484 0.392   smokes now             
## 3     2    95 0.0769  until current pregnancy
## 4     3   103 0.0833  once did, not now      
## 5     9    10 0.00809 unknown

Let \(S\) denote the smoking status of a random mother. The probability that a random mother never smoked: \(P(S=0) = p(0) = 0.4401\) Note that \(S\) is a discrete random variable.

Conditional probability

Compute the probability that a smoking mother has a baby with a weight below 2.6 kg.

\[P(W<2.6|S=1)\]

Compute the probability that a mother who never smoked has a baby with a weight below 2.6 kg.

\[P(W<2.6|S=0)\]

Diagnostic tests

pos neg tot
not cancer 98 882 980
cancer 16 4 20
total 114 886 1000
  • What is the probability of a positive test result from a person with cancer?
  • What is the probability of a negative test result from a person without cancer?
  • If the test is positive, what is the probability of having cancer?
  • If the test is negative, what is the probability of not having cancer?
  • Connect the four computed probabilities with the following four tems;
    • Sensitivity
    • Specificity
    • Positive predictive value (PPV)
    • Negative predictive value (NPV)
    Discuss in your group!

Descriptive statistics

Data types

  • Categorical
    • Nominal: named. Ex: dead/alive, healthy/sick, WT/mutant, AA/Aa/aa, male/female, red/green/blue
    • Ordinal: named and ordered. Ex: pain (weak, moderate, severe), AA/Aa/aa, very young/young/middle age/old/very old, grade I, II, III, IV

Reported as frequencies, proportions, summarized using mode

  • Quantitative
    • Interval: no absolute zero, meaningful to compute interval ratios. Ex: time, temperature
    • Ratio: absolute zero, meaningful to compute ratios. Ex. height, weight, concentration

Often not necessary to distinguish between interval and ratio scale, can be more useful to divide the quantitative scales into

  • Discrete: finite or countable infinite values
  • Continuous: infinitely many uncountable values

Useful summary statistics include mean, median, variance, standard deviation.

Descriptive statistics - Measures of location

Expected value

The expected value of a random variable, or the population mean, is

\[\mu = E[X] = \frac{1}{N}\displaystyle\sum_{i=1}^N x_i,\] where the sum is over all \(N\) data points in the population.

The above formula is probably the most intuitive for finite populations, but for infinite populations other definitions can be used.

For a discrete random variable:

\[\mu = E[X] = \displaystyle\sum_{k=1}^K x_k p(x_k),\]

where the sum is taken over all possible outcomes.

For a continuous random variable:

\[\mu = E[X] = \int_{-\infty}^\infty x f(x) dx\]

Linear transformations and combinations

\[E(aX) = a E(X)\]

\[E(X + Y) = E(X) + E(Y)\]

\[E[aX + bY] = aE[X] + bE[Y]\]

Examples, all with mean value: 3.50

Descriptive statistics - Measures of spread

Variance and standard deviation

The variance of a random variable, the population variance, is defined as

\[\sigma^2 = var(X) = E[(X-\mu)^2]\]

\[\sigma^2 = var(X) = \frac{1}{N} \sum_i^N (x-\mu)^2,\] where the sum is over all \(N\) data points in the population.

\[\sigma^2 = var(X) = E[(X-\mu)^2] = \left\{\begin{array}{ll} \displaystyle\sum_{k=1}^K (x_k-\mu)^2 p(x_k) & \textrm{if }X\textrm{ discrete} \\ \\ \displaystyle\int_{-\infty}^\infty (x-\mu)^2 f(x) dx & \textrm{if }X\textrm{ continuous} \end{array}\right.\]

Standard deviation

\[\sigma = \sqrt{var(X)}\]

Linear transformations and combinations

\[var(aX) = a^2 var(X)\]

For independent random variables X and Y

\[var(aX + bY) = a^2var(X) + b^2var(Y)\]

Exercise: Data summary

Consider the below data and summarize each of the variables.

id smoker baby weight (kg) gender mother weight (kg) mother age parity married
1 yes 2.8 F 64 21 2 yes
2 yes 3.2 F 65 27 1 yes
3 yes 3.5 M 64 31 2 no
4 yes 2.7 M 73 32 0 yes
5 yes 3.3 F 59 39 3 no
6 no 3.7 M 61 26 0 yes
7 no 3.3 M 52 27 2 no
8 no 4.3 M 59 21 0 no
9 no 3.2 M 65 28 1 yes
10 no 3.0 F 73 33 4 no

Statistical inference

Draw conclusions regarding properties of a population based on observations of a random sample from the population.

Sample mean

The sample mean is denoted \(m = \bar x\). For a sample of size \(n\) the sample mean is:

\[m = \bar x = \frac{1}{n}\displaystyle\sum_{i=1}^n x_i\]

When we only have a sample of size \(n\), the sample mean \(m\) is our best estimate of the population mean. It is possible to show that the sample mean is an unbiased estimate of the sample mean, i.e. the average (over many size \(n\) samples) of the sample mean is \(\mu\).

\[E[\bar X] = \frac{1}{n} n E[X] = E[X] = \mu\]

Sample variance

The sample variance is computed as

\[s^2 = \frac{1}{n-1} \sum_{i=1}^n (x-m)^2\] The sample variance is an unbiased estimate of the population variance.

\[E[s^2] = \sigma^2\]


Normal distribution

The normal distribution (sometimes referred to as the Gaussian distribution) is a common probability distribution and many continuous random variables can be described by the normal distribution or be approximated by the normal distribution.

The normal probability density function

\[f(x) = \frac{1}{\sqrt{2 \pi} \sigma} e^{-\frac{1}{2} \left(\frac{x-\mu}{\sigma}\right)^2}\]

describes the distribution of a normal random variable, \(X\), with expected value \(\mu\) and standard deviation \(\sigma\). In short we write \(X \sim N(\mu, \sigma)\).

The bell-shaped normal distributions is symmetric around \(\mu\) and \(f(x) \rightarrow 0\) as \(x \rightarrow \infty\) and as \(x \rightarrow -\infty\).

As \(f(x)\) is well defined, values for the cumulative distribution function \(F(x) = \int_{- \infty}^x f(x) dx\) can be computed.

If \(X\) is normally distributed with expected value \(\mu\) and standard deviation \(\sigma\) we write:

\[X \sim N(\mu, \sigma)\]

Using transformation rules we can define

\[Z = \frac{X-\mu}{\sigma}, \, Z \sim N(0,1)\]

Values of \(F(z)\), the standard normal distribution, are tabulated (and easy to compute in R using the function pnorm).

Some value of particular interest: F(1.64) = 0.95 F(1.96) = 0.975

As the normal distribution is symmetric F(-1.64) = 0.05 F(-1.96) = 0.025

P(-1.96 < Z < 1.96) = 0.95

Sum of two normal random variables

If \(X \sim N(\mu_1, \sigma_1^2)\) and \(Y \sim N(\mu_2, \sigma_2^2)\) are two independent normal random variables, then their sum is also a random variable:

\[X + Y \sim N(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2)\]

and

\[X - Y \sim N(\mu_1 - \mu_2, \sigma_1^2 + \sigma_2^2)\]

Central limit theorem

The sum of \(n\) independent and equally distributed random variables is normally distributed, if \(n\) is large enough.

As a result of central limit theorem, the distribution of fractions or mean values of a sample follow the normal distribution, at least if the sample is large enough (a rule of thumb is that the sample size \(n>30\)).

Example: Mean BMI

The data set fat consists measurements for 252 men, let’s take a closer look at the BMI.

## Population mean
mu <- mean(fat$BMI)
mu
## [1] 25
## Population variance
sigma2 <- var(fat$BMI)/nrow(fat)*(nrow(fat)-1)
sigma2
## [1] 13
## Population standard variance
sigma <- sqrt(sigma2)
sigma
## [1] 3.6

Randomly sample 3, 5, 10, 15, 20, 30 men and compute the mean value, \(m\). Repeat many times to get the distribution of mean values.

Note, mean is just the sum divided by the number of samples \(n\).